Skip to content

Conversation

@Arya-Hari
Copy link

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Sep 2, 2025
@jeffbolznv
Copy link
Collaborator

This is just a combination of the 3 existing ops, right? Seems like we could handle this with fusion and not need a new op.

@Arya-Hari
Copy link
Author

@jeffbolznv Operator fusion what I was going for really, but got confused on the way. Any suggestion on how to implement this with operator fusion? I'm new to this kind of work and don't have much experience, so any advice would be helpful. Thanks!

@jeffbolznv
Copy link
Collaborator

Which backend(s) do you want to do it in?

If you search for "can_fuse" you can find some examples of fusion in several backends.

What kind of models have this pattern? I haven't seen it before.

@Arya-Hari
Copy link
Author

Arya-Hari commented Sep 3, 2025

@jeffbolznv I'm actually trying to do the same thing done here, then to try and replicate for a Vulkan backend to run on an Adreno 750 GPU on an Android phone.

@jeffbolznv
Copy link
Collaborator

Does the Vulkan backend currently run on your phone? We've had mixed reports in the past (running into various driver bugs, it seems).

The Vulkan backend does have some support for fusion. To implement this, you'd probably need to add code to the soft max shader to conditionally apply the scale+diag_mask, and add the host side logic to select that shader.

Is this combination of ops still interesting now that flash attention is broadly supported and is the default?

@Arya-Hari
Copy link
Author

Arya-Hari commented Sep 3, 2025

@jeffbolznv Vulkan backend does run on the phone, although the metrics aren't great. It has slower decode rates than using CPU alone for most LLMs. So that's why I've been trying to improve it in some sense. OpenCL also works, better than Vulkan.

As for flash attention, is it supported for Qualcomm architectures? Would something like this be possible to test on a mobile Adreno 750?

@jeffbolznv
Copy link
Collaborator

Flash attention ought to work on all devices. Please try test-backend-ops -o FLASH_ATTN_EXT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants